Improving Availability in Distributed Systems with Failure Informers

نویسندگان

  • Trinabh Gupta
  • Joshua B. Leners
  • Marcos K. Aguilera
  • Michael Walfish
چکیده

This paper addresses a core question in distributed systems: how should applications be notified of failures? When a distributed system acts on failure reports, the system’s correctness and availability depend on the granularity and semantics of those reports. The system’s availability also depends on coverage (failures are reported), accuracy (reports are justified), and timeliness (reports come quickly). This paper describes Pigeon, a failure reporting service designed to enable high availability in the applications that use it. Pigeon exposes a new abstraction, called a failure informer, which allows applications to take informed, application-specific recovery actions, and which encapsulates uncertainty, allowing applications to proceed safely in the presence of doubt. Pigeon also significantly improves over the previous state of the art in the three-way trade-off among coverage, accuracy, and timeliness.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Comparing Parallel Simulated Annealing, Parallel Vibrating Damp Optimization and Genetic Algorithm for Joint Redundancy-Availability Problems in a Series-Parallel System with Multi-State Components

In this paper, we study different methods of solving joint redundancy-availability optimization for series-parallel systems with multi-state components. We analyzed various effective factors on system availability in order to determine the optimum number and version of components in each sub-system and consider the effects of improving failure rates of each component in each sub-system and impr...

متن کامل

Improving Data Availability Using Combined Replication Strategy in Cloud Environment

As grow as the data-intensive applications in cloud computing day after day, data popularity in this environment becomes critical and important. Hence to improve data availability and efficient accesses to popular data, replication algorithms are now widely used in distributed systems. However, most of them only replicate the static number of replicas on some requested chosen sites and it is ob...

متن کامل

Reliability Analysis of Redundant Repairable System with Degraded Failure

This investigation deals with the transient analysis of the machine repair system consisting of M-operating units operating under the care of single repairman. To improve the system reliability/availability, Y warm standby and S cold standby units are provided to replace the failed units. In case when all spares are being used, the failure of units occurs in degraded fashion. In such situation ...

متن کامل

Performance Analysis of a Repairable Robot Safety System with Standby, Imperfect Coverage and Reboot Delay

The present study deals with a robot safety system composed of standby robot units and inbuilt safety unit. When the main operative unit fails, it is replaced by the standby robot unit available in the system. The concept of reboot delay is also incorporated in this study according to which the robot unit is rebooted if it is not successfully recovered. The recovery and reboot times of failed u...

متن کامل

Investigating and Improving the Effect of Distributed Generation on Reliability in Wind Systems by ARMA Method

The increasing development of the use of new energies has led to the widespread use of power electronics, so that electronic power converters play an important role in extracting power from renewable sources. Power electronics can convert raw energy produced from new energy into the desired power with controlled current, voltage and frequency to be used in the power grid. Restructuring of power...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2013